Independence

What’s wrong with this model?

  • DV: County level GOP vote share

  • IV: Number of years after founding that the state joined the union

library(tidyverse)
library(rvest)
library(tidycensus)

# county level election results
counties_24<-read_csv('https://raw.githubusercontent.com/tonmcg/US_County_Level_Election_Results_08-24/refs/heads/master/2024_US_County_Level_Presidential_Results.csv')

# list of states by date of joining the union
page<-read_html('https://en.wikipedia.org/wiki/List_of_U.S._states_by_date_of_admission_to_the_Union')

join_date<-page|>
  html_element(css='table')|>
  html_table()|>
  select(2:4)|>
  mutate(datestring = str_extract(`Date(admitted or ratified)`, "([A-Z][a-z]+ [0-9]{1,2}, [0-9]{4})"),
         date = mdy(datestring)
  )
counties<-counties_24|>
  left_join(join_date, by=join_by(state_name==State))|>
  mutate(join_year = year(date) - 1787,
         percent_gop = per_gop * 100
         
         )|>
  drop_na(join_year)

model<-lm(percent_gop ~ join_year, data=counties)

What’s wrong with this model?

Here’s what our output looks like:

tinytable_y3w6e0scf0mxl77h4dy7
DV: County GOP vote % in 2024
model 1
note: constant = 64.7
State year joined union 0.044
[0.030, 0.058]
Num.Obs. 3152
R2 Adj. 0.012
BIC 26309.4

What’s wrong with this model?

We can get a better sense of the problem by visualizing the relationship

What’s wrong with this model?

Finally, here’s what it looks like when we plot the model residuals on a map of the US:

Regression Assumptions

Some important regression assumptions:

  • Linearity in parameters

  • No perfect multicolinearity

  • Constant Variance

Heteroskedasticity is one way to violate the constant variance assumption, but another way to violate it is through clustering.

Sources of dependence: Sampling strategies

  • Surveys typically target strata rather than using true random sampling

  • Researchers may target neighborhoods or census tracts rather than individuals when conducting a field experiment, and they’ll often survey an entire household rather than a single individual.

  • Weighting accounts for the non-random selection, but survey analysts also need to account for the lack of independence within strata to calculate standard errors.

StratifiedSampling cluster_level1 Strata cluster_level2 Sampled metro areas and counties cluster_level3 Sampled census blocks cluster_level4 Households pop Strata A metro1 Urban Metro (40M) pop->metro1 metro2 County (35M) pop->metro2 metro3 Group of counties (25M) pop->metro3 census1 Block 1A (2,000) metro1->census1 census2 Block 1B (1,800) metro1->census2 census3 Block 2A (2,200) metro2->census3 census4 Block 2B (1,900) metro2->census4 census5 Block 3A (1,500) metro3->census5 census6 Block 3B (1,600) metro3->census6 house1 HH-001 (4 people) census1->house1 house2 HH-052 (3 people) census1->house2 house3 HH-103 (2 people) census2->house3 house4 HH-154 (5 people) census3->house4 house5 HH-205 (3 people) census3->house5 house6 HH-256 (4 people) census4->house6 house7 HH-307 (2 people) census5->house7 house8 HH-358 (6 people) census6->house8

Sources of dependence: spatial autocorrelation

Things close to each other will tend to correlate, so even if we use only county-level covariates, we might still have some dependence just due to proximity:

From Manuel Gimond

From Manuel Gimond

Sources of dependence: time series and panel data

Most things are correlated with their own past values (temporal autocorrelation) and observing the same case multiple times doesn’t necessarily mean you’ve collected totally new data:

Sources of dependence: Hierarchical data

Hierarchical data data come from multiple different levels of analysis, so the same observation is just being duplicated for each row. This comes up a lot if you want to study stuff like:

  • The effects of immigration levels on individual public opinion

  • The effect of electoral systems on the characteristics of legislators

  • The effect of classroom size on student outcomes

  • The average effect of a treatment across multiple studies

Why it matters

  • Primary issue: unrealistically small standard errors. Which means we underestimate uncertainty (small p.values, narrow confidence intervals etc)

  • We have far fewer independent observations than the number rows in our data might imply.

    • There’s over 3,000 counties, but we’ve got data from 50 states that we’ve duplicated multiple times.
  • We can also have spurious results from certain patterns of autocorrelation: if two variables are increasing over time, they’ll look correlated in a time series regression.

  • Groups themselves can be confounders.

Some Strategies

As usual, the specific matter, and we need to know how things are correlated before we can really address this problem. But once we have that, some options are:

  • Aggregation

  • Cluster robust standard errors

  • Fixed Effects

  • Random effects/Mixed models

  • Differencing or lagging (for time series)

How big is the problem?

One way to get a sense of the scope of the problem is to calculate an effective sample size and design effect for the clusters.

The ESS a rough estimate of the number of independent samples that would provide the same level of certainty as our clustered data.

ca<-counties|>
  mutate(grand_mean = mean(percent_gop))|>
  group_by(state_name)|>
  mutate(group_mean = mean(percent_gop)
         )|>
  ungroup()
fx<-ca|>
  summarize(ssb = sum((group_mean - grand_mean)^2), # between clusters
            ssw = sum((percent_gop - group_mean)^2), # within clusters
            sse = ssb+ssw,
            icc = ssb/sse,             # intra-cluster correlation
            ess = 1 + (n() - 1) * icc,  # effective sample size
            )
n_u<-mean(table(counties$state_name))
fx$deff <- (1 + (n_u-1) * fx$icc)
fx

Aggregation

In the election model, we could just aggregate everything up to the state level so we no longer have nested data:

state_data<-counties|>
  ungroup()|>
  group_by(state_name)|>
  summarise(state_total = sum(total_votes),
            join_year = unique(join_year),
            perc_gop = sum(votes_gop)/state_total * 100
            
            )


model_2<-lm(perc_gop ~ join_year, data=state_data)
tinytable_6z2p5bwz3agypaczl1bx
DV: County GOP vote % in 2024
Original Aggregated
note: constant = 64.7
State year joined union 0.044 0.053
[0.030, 0.058] [-0.006, 0.113]
Num.Obs. 3152 50
R2 Adj. 0.012 0.044

Aggregation

  • Good: This fixes the state-level autocorrelation (although we might still be concerned about regional autocorrelation!)

  • Good: Once we’ve done this, we just estimate a linear model

  • Bad: we lose a lot of nuance!

    • We can’t include any county-level covariates in the model
    • Counts all observations equally (here this might make sense, but maybe not in all cases)
    • Potentially throws out useful information about precision (some estimates should be seen as more precise because we have more data about them)

Standard Error Adjustments

If we have an idea about the nature of the clustering, we can potentially model it an adjust our standard errors accordingly.

Clustered Bootstrap

  • Bootstrapped Standard Errors: repeatedly re-sample with replacement from the original data, calculating the same statistic after each step, then measure the variance in those resampled statistics

  • Clustered Bootstrap: repeatedly resample each cluster with replacement.

Clustered Bootsrap

set.seed(999)
r<- 3000
states<-unique(counties$state_name)
coefs<-matrix(0, nrow=r, ncol=2)

for(i in 1:r){
  clusters<-sample(states, size=length(states), replace=T)
  index<-unlist(sapply(clusters, function(x) which(counties$state_name==x)))
  county_boot<-counties[index,]
  bootmodel<-lm(percent_gop ~ join_year, data=county_boot)
  coefs[i,] <-coef(bootmodel)

  
}


cbind(colMeans(coefs),apply(coefs, 2, sd))
            [,1]       [,2]
[1,] 64.61350050 2.20216954
[2,]  0.04436051 0.03513735

Clustered Bootstrap

tinytable_euuk2guibmbm3r9iei54
DV: County GOP vote % in 2024
No clustering Clustered Bootstrap
note: Standard errors in parentheses, 95% ci in brackets
State year joined union 0.044 0.044
[0.030, 0.058] [-0.026, 0.113]
(0.007) (0.035)
Num.Obs. 3152 3152
R2 Adj. 0.012 0.012
BIC 26309.4 26309.4
Std.Errors IID Bootstrap

Cluster Robust Errors

\[\mathbb{V}[\beta] = \frac{\hat{\epsilon}' \hat{\epsilon}}{n - k} (X'X)^{-1}\]

# make a matrix with an intercept and join_year
X<-model.matrix(percent_gop ~ join_year, data = counties)
y <- counties$percent_gop
n <- nrow(X) # number of observations
k <- ncol(X) # number of parameters (including intercept)

# calculate regression coefficients
coefs<- solve(crossprod(X)) %*% crossprod(X,y) 

# Standard SE:
# predictions
y_pred <- X %*% coefs
# residuals
u <- y - y_pred

# calculate the variance-covariance matrix for the betas
sigma<-sum(u^2) / (n-k)
vcov<- sigma *  solve(crossprod(X))
# the square root of the diagonals give the standard errors
regular_se<-sqrt(diag(vcov))
# model output:
cbind(coefs, regular_se)
                         regular_se
(Intercept) 64.69256397 0.438844128
join_year    0.04382225 0.007107679

Cluster Robust Errors

\[ \hat{\Omega} = \frac{n-1}{n-k}\frac{c}{c-1} \sum_{c=1}^C (X_c' \hat{\epsilon}_c \hat{\epsilon}_c' X_c) \]

\[ \mathbb{V}[\beta] = (X' X)^{-1} \quad X' \Omega X \quad (X' X)^{-1} \]

# Clustered SE:
# The clustering ----
states<-unique(counties$state_name)
n_states <- length(states)
# create the empty list
list_of_vcovs <- vector(mode = "list", length = n_states)
# naming the list elements
names(list_of_vcovs) <- states
for(state in states) {
  # get index of each state observation
  cluster_index <- which(counties$state_name == state)
  # now we can subset our data by those indices..
  X_cluster = X[cluster_index,]
  # get residuals for this cluster
  u_c <- u[cluster_index]
  # then do the calculation u_c u_c'
  vcov_c <- t(X_cluster) %*% tcrossprod(u_c) %*% X_cluster
  # add this matrix to the list
  list_of_vcovs[[state]] <- vcov_c
}
# add the separate variance covariance matrices together
omega <- n_states/(n_states-1) * (n-1)/(n-k) * Reduce("+",list_of_vcovs)
# variance-covariance matrix
vcov <- (solve(crossprod(X)) %*% omega %*% solve(crossprod(X))) 
# standard errors
se_clustered <- sqrt(diag(vcov))
# print the coefficients and standard errors
cbind(coef(model), regular_se, se_clustered)
                         regular_se se_clustered
(Intercept) 64.69256397 0.438844128   2.15122512
join_year    0.04382225 0.007107679   0.03489429

Cluster Robust Errors

mlist<-list('No clustering' = model, 
                  'Clustered Bootstrap' = model,
                  'Clustered SE' = model
                  )
modelsummary(mlist,
 estimate  = "{estimate}",  
  statistic = c("conf.int", 'std.error'),
 conf_level = .95,        
 coef_map = list('join_year' = 'State year joined union'
                 ),
 note = "note: Standard errors in parentheses. 95% ci in brackets",
 gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.',
 title= 'DV: County GOP vote % in 2024',
 vcov = c("classical", "bootstrap", ~state_name),
 R = 3000,
 cluster=~state_name)
tinytable_6m9aby16al8vi6bjkozz
DV: County GOP vote % in 2024
No clustering Clustered Bootstrap Clustered SE
note: Standard errors in parentheses. 95% ci in brackets
State year joined union 0.044 0.044 0.044
[0.030, 0.058] [-0.026, 0.113] [-0.025, 0.112]
(0.007) (0.035) (0.035)
Num.Obs. 3152 3152 3152
R2 Adj. 0.012 0.012 0.012
BIC 26309.4 26309.4 26309.4
Std.Errors IID Bootstrap by: state_name

Cluster Robust Errors

Note that the coefficient estimates are the same across all three versions of this model. All that changes are the confidence intervals:

modelplot(mlist, 
          vcov = c("classical", "bootstrap", ~state_name),
          cluster = ~state_name,
          coef_omit='Intercept'
          )

Standard Error Adjustments

  • Good: easy to use as long as you know the nature of the clustering.

  • In theory, there’s no cost here: if groups have no clustering, then the standard errors should be more or less like regular classical standard errors

  • Clustering doesn’t account for group-level characteristics that might impact \(\hat{\beta}\). It only adjusts the standard errors.

Fixed Effects Models

  • What if we want to account for some unobserved group-level confounding? Clustering only adjusts errors, but if there’s group-level confounding, we might also want to adjust our coefficient estimates.

Fixed Effects Models

Theory: parties in power want to be seen as competent, so they’ll be less likely to emphasize valence issues like corruption.

  • Data: 2019 round of the Chapel Hill Expert Survey for Europe

  • Unit of Analysis: political party

  • IV: whether the party is in government

  • DV: corruption salience

We have clustering within countries, but we might also suspect that country-level characteristics like the level of actual corruption could act as confounders here.


Call:
lm(formula = corrupt_salience ~ in_gov, data = ches)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8971 -1.9705  0.0293  1.6207  5.3463 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)           4.7543     0.1774  26.796   <2e-16 ***
in_govIn government  -0.7672     0.3253  -2.358   0.0192 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.265 on 230 degrees of freedom
Multiple R-squared:  0.02361,   Adjusted R-squared:  0.01936 
F-statistic: 5.561 on 1 and 230 DF,  p-value: 0.0192

Fixed Effects Models

Countries have systematic differences with respect to the dependent variable, and these differences are probably related to characteristics that we haven’t measured here:

ches|>
  group_by(country)|>
  mutate(corruption_mean = mean(corrupt_salience),
         )|>
  ungroup()|>
  mutate(country = reorder(country, corruption_mean))|>
  ggplot(aes(x=country, y=corrupt_salience)) + 
  geom_point() +
  geom_point(aes(x=country, y=corruption_mean), color='red') +
  coord_flip() +
  theme_bw() +
  labs(x='Country', y='Corruption salience', 
       caption='red = country mean')

Fixed Effects Models

\[ Y_{ij} = \beta_1 X_{ij} + \alpha_j + \epsilon_{ij} \]

  • \(Y_{ij}\) Outcome for party i in country j

  • \(X_{ij}\) IV for party i in country j

  • \(\alpha_j\) A vector of country-specific intercept terms

  • \(\epsilon_{ij}\) error

Fixed Effects Models

One way to think about this approach is that it amounts to mean-centering all of our covariates at the group-level. This effectively controls for country-level differences.

ches|>
  group_by(country)|>
  mutate(corruption_mean =  mean(corrupt_salience),
         corrupt_salience = corrupt_salience -  corruption_mean 
         )|>
  ungroup()|>
  mutate(country = reorder(country, corruption_mean))|>
  ggplot(aes(x=country, y=corrupt_salience)) + 
  geom_point() +
  coord_flip() +
  theme_bw() +
  labs(x='Country', y='Corruption salience',  caption='red = country mean') +
  geom_hline(yintercept =0, lty=2, col='red')

Fixed Effects Models

Basic procedure:

  • Create an indicator variable for every “cluster”. i.e.: one per country or one per state

  • Include N-1 indicators in a regression model along side your variables of interest

  • Note: This effectively controls for fixed characteristics of the grouping variable, including characteristics you didn’t measure. So everything should be interpreted as an average effect relative to the group mean

  • Alternatively: subtract the group level mean value of all IVs and DVs from each covariate, then run the model on these centered variables.

Fixed Effects Models

The actual output here is a little unwieldy, and so the fixed effects terms are usually excluded from output:

fe_model<-lm(corrupt_salience ~ in_gov + country,  data = ches) 
broom::tidy(fe_model)

Fixed Effects Models

library(fixest)

original_model<-lm(corrupt_salience ~ in_gov, data = ches) 
clustered_model<-feols(corrupt_salience ~ in_gov, data=ches, cluster=~country)
fe_model<-feols(corrupt_salience ~ 1+ in_gov |country, data=ches)

coefs<-c('in_govIn government' = 'Party in government',
         'in_gov' = 'Party in government'
         )
mlist<-list('original'= original_model, 
            'clustered'= clustered_model,
            'fixed effects'= fe_model
            )

modelsummary(mlist,
 estimate  = "{estimate}",  
  statistic = c("conf.int", 'std.error'),
 conf_level = .95,        
 note = "note: Standard errors in parentheses",
 gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.',
 title= 'DV: Corruption salience',
 coef_rename = coefs
 
 )
tinytable_iyvax6hnp90z8qfiquy4
DV: Corruption salience
original clustered fixed effects
note: Standard errors in parentheses
(Intercept) 4.754 4.754
[4.405, 5.104] [3.898, 5.611]
(0.177) (0.417)
Party in government -0.767 -0.767 -0.922
[-1.408, -0.126] [-1.694, 0.160] [-1.638, -0.205]
(0.325) (0.452) (0.349)
Num.Obs. 232 232 232
R2 Adj. 0.019 0.019 0.604
R2 Within 0.078
R2 Within Adj. 0.073
BIC 1052.1 1046.7 954.3
Std.Errors by: country by: country

::: :::::

Fixed Effects Models: as mean-centering

Note that this really is equivalent to mean-centering both variables (although this is a little tricky for categorical variables)

demeaned_model<-ches|>
  group_by(country)|>
  mutate(corrupt_salience = corrupt_salience - mean(corrupt_salience),
         in_gov = as.numeric(in_gov == 'In government') - mean(as.numeric(in_gov == 'In government'))
         )|>
  ungroup()|>
  feols(corrupt_salience ~ 1 + in_gov, data=_, cluster=~country)

mlist<-list('original'= original_model, 
            'clustered'= clustered_model,
            'fixed effects'= fe_model,
            'de-meaned model' = demeaned_model
            )

modelsummary(mlist,
 estimate  = "{estimate}",  
  statistic = c("conf.int", 'std.error'),
 conf_level = .95,        
 note = "note: Standard errors in parentheses",
 gof_omit = 'F|RMSE|R2$|AIC|Log.Lik.',
 title= 'DV: Corruption salience',
 coef_rename = coefs)

Fixed Effects Models: as mean-centering

tinytable_96e15pp38byyh2zfgl0j
DV: Corruption salience
original clustered fixed effects de-meaned model
note: Standard errors in parentheses
(Intercept) 4.754 4.754 0.000
[4.405, 5.104] [3.898, 5.611] [0.000, 0.000]
(0.177) (0.417) (0.000)
Party in government -0.767 -0.767 -0.922 -0.922
[-1.408, -0.126] [-1.694, 0.160] [-1.638, -0.205] [-1.638, -0.205]
(0.325) (0.452) (0.349) (0.349)
Num.Obs. 232 232 232 232
R2 Adj. 0.019 0.019 0.604 0.074
R2 Within 0.078
R2 Within Adj. 0.073
BIC 1052.1 1046.7 954.3 807.2
Std.Errors by: country by: country by: country

Fixed Effects Models

  • Good: accounts for group-level characteristics, including things we never measured or anticipated would matter

  • Good: similar to aggregation or centering variables, but somewhat more flexible

  • Bad: we can only include things that vary at the level of the fixed effect, so if we wanted to measure the effect of perceived corruption in a state, this won’t work.

  • Bad: soaks up a lot of variance. Maybe not necessary.